Learning Dissimilarities for Categorical Symbols

نویسندگان

  • Jierui Xie
  • Boleslaw K. Szymanski
  • Mohammed J. Zaki
چکیده

In this paper we learn a dissimilarity measure for categorical data, for effective classification of the data points. Each categorical feature (with values taken from a finite set of symbols) is mapped onto a continuous feature whose values are real numbers. Guided by the classification error based on a nearest neighbor based technique, we repeatedly update the assignment of categorical symbols to real numbers to minimize this error. Intuitively, the algorithm pushes together points with the same class label, while enlarging the distances to points labeled differently. Our experiments show that 1) the learned dissimilarities improve classification accuracy by using the affinities of categorical symbols; 2) they outperform dissimilarities produced by previous data-driven methods; 3) our enhanced nearest neighbor classifier (called LD) based on the new space is competitive compared with classifiers such as decision trees, RBF neural networks, Näıve Bayes and support vector machines, on a range of categorical datasets.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Discrepancy Analysis of Complex Objects Using Dissimilarities

In this article we consider objects for which we have a matrix of dissimilarities and we are interested in their links with covariates. We focus on state sequences for which pairwise dissimilarities are given for instance by edit distances. The methods discussed apply however to any kind of objects and measures of dissimilarities. We start with a generalization of the analysis of variance (ANOV...

متن کامل

Creating Algorithmic Symbols to Enhance Learning English Grammar

This paper introduces a set of English grammar symbols that the author has developed to enhance students’ understanding and consequently, application of the English grammar rules. A pretest-posttest control-group design was carried out in which the samples were students in two girls’ senior high schools (N=135, P ≤ 0.05) divided into two groups: the Treatment which received gramm...

متن کامل

An association-based dissimilarity measure for categorical data

In this paper, we propose a novel method to measure the dissimilarity of categorical data. The key idea is to consider the dissimilarity between two categorical values of an attribute as a combination of dissimilarities between the conditional probability distributions of other attributes given these two values. Experiments with real data show that our dissimilarity estimation method improves t...

متن کامل

Exploring Sequential Data

The tutorial is devoted to categorical sequence data describing for instance the successive buys of customers, working states of devices, visited web pages, or professional careers. Addressed topics include the rendering of state and event sequences, longitudinal characteristics of sequences, measuring pairwise dissimilarities and dissimilarity-based analysis of sequence data such as clustering...

متن کامل

On-line relational and multiple relational SOM

In some applications and in order to address real-world situations better, data may be more complex than simple numerical vectors. In some examples, data can be known only through their pairwise dissimilarities or through multiple dissimilarities, each of them describing a particular feature of the data set. Several variants of the Self Organizing Map (SOM) algorithm were introduced to generali...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010